fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016
fix(query): Remove duplicated rows when datasets shared Publicly in the hub page#6016Mrudhulraj wants to merge 11 commits into
Conversation
|
👋 Thanks for opening this pull request, @Mrudhulraj! It looks like the pull request description doesn't quite follow our template yet:
Filling out the template helps reviewers understand and triage your contribution faster. Please edit the description to complete it. This message will disappear automatically once the template is followed. You can find the template prompts by editing the description, or see CONTRIBUTING.md for the full contribution flow. |
Automated Reviewer SuggestionsBased on the
|
|
Add some specs to both PRs |
|
Thanks, this is much more focused now. The core fix looks right to me: filtering I have two requests:
Optional cleanup: the condition rewrite is logically okay, but the PR could be smaller if we keep the existing WHERE logic and only change the JOIN. Also, the new comments have a couple typos and “join is skipped” is slightly misleading because the join remains present but is forced to match no rows when |
|
I have fixed it and added appropriate comments. |
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 377 | 0.23 | 25,241/41,163/41,163 us | 🔴 +21.5% / 🔴 +173.1% |
| 🟢 | bs=100 sw=10 sl=64 | 810 | 0.495 | 123,125/137,036/137,036 us | 🟢 -14.0% / 🔴 +27.4% |
| ⚪ | bs=1000 sw=10 sl=64 | 932 | 0.569 | 1,064,666/1,125,263/1,125,263 us | ⚪ within ±5% / 🔴 +9.4% |
Baseline details
Latest main 1a58433 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 377 tuples/sec | 405 tuples/sec | 777.62 tuples/sec | -6.9% | -51.5% |
| bs=10 sw=10 sl=64 | MB/s | 0.23 MB/s | 0.247 MB/s | 0.475 MB/s | -6.9% | -51.5% |
| bs=10 sw=10 sl=64 | p50 | 25,241 us | 23,715 us | 12,612 us | +6.4% | +100.1% |
| bs=10 sw=10 sl=64 | p95 | 41,163 us | 33,887 us | 15,070 us | +21.5% | +173.1% |
| bs=10 sw=10 sl=64 | p99 | 41,163 us | 33,887 us | 18,360 us | +21.5% | +124.2% |
| bs=100 sw=10 sl=64 | throughput | 810 tuples/sec | 797 tuples/sec | 988.31 tuples/sec | +1.6% | -18.0% |
| bs=100 sw=10 sl=64 | MB/s | 0.495 MB/s | 0.486 MB/s | 0.603 MB/s | +1.9% | -17.9% |
| bs=100 sw=10 sl=64 | p50 | 123,125 us | 123,089 us | 101,066 us | +0.0% | +21.8% |
| bs=100 sw=10 sl=64 | p95 | 137,036 us | 159,269 us | 107,594 us | -14.0% | +27.4% |
| bs=100 sw=10 sl=64 | p99 | 137,036 us | 159,269 us | 115,830 us | -14.0% | +18.3% |
| bs=1000 sw=10 sl=64 | throughput | 932 tuples/sec | 918 tuples/sec | 1,019 tuples/sec | +1.5% | -8.6% |
| bs=1000 sw=10 sl=64 | MB/s | 0.569 MB/s | 0.56 MB/s | 0.622 MB/s | +1.6% | -8.6% |
| bs=1000 sw=10 sl=64 | p50 | 1,064,666 us | 1,085,862 us | 986,982 us | -2.0% | +7.9% |
| bs=1000 sw=10 sl=64 | p95 | 1,125,263 us | 1,139,586 us | 1,028,491 us | -1.3% | +9.4% |
| bs=1000 sw=10 sl=64 | p99 | 1,125,263 us | 1,139,586 us | 1,058,493 us | -1.3% | +6.3% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,529.80,200,128000,377,0.230,25241.47,41162.51,41162.51
1,100,10,64,20,2468.11,2000,1280000,810,0.495,123124.61,137036.14,137036.14
2,1000,10,64,20,21459.36,20000,12800000,932,0.569,1064666.45,1125262.78,1125262.78…n the hub page fix(dashboard): Remove duplicated rows when dataset shared Publicly in the hub page
…che#6015) ### What changes were proposed in this PR? `AGENTS.md` documents the single-node and k8s deploy paths but never tells an agent how to run the local dev stack. This PR: - adds a "Where Things Live" row routing to `bin/local-dev.sh` — the single entry point (infra in Docker; backend, frontend, and agent-service run natively); - expands the "Develop in a worktree" section to: - prefer `bin/local-dev.sh` while developing; - bounce the stack across worktree switches (`down` in the old worktree, `up` in the new one), since the native services bind fixed ports and share one PID/state dir, so only one worktree's stack runs at a time; - use the non-interactive CLI subcommands — the interactive TUI (`-i`) is for humans, not agents. - drops "Local" from the single-node / k8s deploy row — both can deploy anywhere, not only locally. Only the wrapper is surfaced, never the internal `bin/local-dev/main.sh`. ### Any related issues, documentation, discussions? Closes apache#6014 ### How was this PR tested? Docs-only change; no tests. Verified the linked `bin/local-dev/README.md` path resolves and that the described subcommands (`up` / `down` / `status` / `logs` / `-i`) match `bin/local-dev.sh --help`. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8)
…omote to errors (apache#5985) ### What changes were proposed in this PR? Fix the two remaining Angular extended-diagnostic warnings surfaced by the frontend build, add a unit test for the affected logic, and promote both checks to build errors so regressions can't silently creep back in: - **NG8113** — `CardItemComponent` listed `NzWaveDirective` in its standalone `imports` array but never used it in the template. Removed the unused directive import. - **NG8107** — `computing-unit-selection.component.html` used `pve.name?.trim()`, but `PveDraft.name` is a non-nullable `string`, so the optional chaining is redundant. Extracted the predicate into an `isCreateDisabled(pve)` component method bound via `[disabled]="isCreateDisabled(pve)"`, and added unit tests covering empty / whitespace / valid / padded names (the create button stays disabled until the env name has non-whitespace content). - **Enforcement** — set `unusedStandaloneImports` (NG8113) and `optionalChainNotNullable` (NG8107) to `"error"` under `angularCompilerOptions.extendedDiagnostics.checks`. Placed in `tsconfig.app.json` (the prod build, `build:ci`) rather than the base tsconfig, because `extendedDiagnostics` requires `strictTemplates`, which the unit-test builds deliberately disable; `tsconfig.test.json` clears the inherited option to avoid `NG4003`. No runtime behavior change. ### Any related issues, documentation, discussions? Closes apache#5984 ### How was this PR tested? - `ng test` (the affected spec): **46 passed**, including the 4 new `isCreateDisabled` cases. - `ng build`: succeeds with **0** NG8113/NG8107 diagnostics (now configured as errors) and produces the normal bundle. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rkflowExecutionService (apache#5922) ### What changes were proposed in this PR? Consolidates execution init-error reporting into a **single site owned by `WorkflowExecutionService`**, replacing the two-phase split apache#5781 introduced (per @Yicong-Huang's suggestion, tracked in apache#5921). - `WorkflowExecutionService` now registers its `fatalErrors → WorkflowErrorEvent` (and state) diff handler as the **first** construction action. Construction itself does no external work and cannot throw — it only assigns `workflowSettings`, creates a `WebsocketInput` (a `PublishSubject`), and registers the handler. All throwing work lives in `executeWorkflow()`, which runs **after** `executionService.onNext(...)` publishes the execution, so any failure there is recorded by `errorHandler` into the metadata store and surfaced by the handler that `connectToExecution` forwards. - `WorkflowService.initExecutionService` drops the pre-publish `errorSubject` fallback: the `executionPublished` gating and `reportFatalErrorsToSubscribers` are gone, and the catch is simply `errorHandler(e)`. The now-unused `errorSubject` field and its `connect()` subscription are removed. - `WorkflowServiceSpec` is removed — it only tested the deleted `reportFatalErrorsToSubscribers`; the surfacing behavior is exercised by the integration/e2e suites. Net: a single reporting path (the metadata-store diff handler), with no change to the init-error surfacing apache#5781 added — construction is provably side-effect-free, so the pre-publish window no longer has a failure mode. ### Any related issues, documentation, discussions? Resolves apache#5921 — the follow-up refactor agreed during the apache#5781 review. ### How was this PR tested? `scalafmtCheckAll` clean. This is a behavior-preserving refactor of init-error reporting: the single reporting path is the metadata-store diff handler that already surfaced post-publish errors, and construction is side-effect-free so no error can escape it. The full compile, scalafix, and the integration/e2e suites that exercise init-error surfacing run in CI. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.8 in compliance with ASF. --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
### What changes were proposed in this PR? The Codecov patch status was `informational: true`, so it never gated PRs. This makes it a real requirement: - set `target: 60%` on `coverage.status.patch.default`; - add `threshold: 10%` so the patch passes when coverage is at least `target - threshold` = 50%, softening the quantization on tiny diffs (e.g. a 2-coverable-line change passes at 1 line / 50%); - drop the `informational` flag so the patch check can fail. PRs must now cover at least 50% of the lines they change (60% target, 10% slack). Project status (auto target, 1% threshold), flag carryforward, and ignore rules are unchanged. The explanatory comment in `codecov.yml` is updated to match. ### Any related issues, documentation, discussions? Closes apache#6021 ### How was this PR tested? Config-only change; `codecov.yml` validated as well-formed YAML. Codecov applies the new policy on the next PR upload. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8)
…nt (apache#5949) <!-- Thanks for sending a pull request (PR)! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: [Contributing to Texera](https://github.com/apache/texera/blob/main/CONTRIBUTING.md) 2. Ensure you have added or run the appropriate tests for your PR 3. If the PR is work in progress, mark it a draft on GitHub. 4. Please write your PR title to summarize what this PR proposes, we are following Conventional Commits style for PR titles as well. 5. Be sure to keep the PR description updated to reflect all changes. --> ### What changes were proposed in this PR? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes. Here are some tips for you: 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. 3. If it is a refactoring, clarify what has been changed. 3. It would be helpful to include a before-and-after comparison using screenshots or GIFs. 4. Please consider writing useful notes for better and faster reviews. --> > [Stacked on apache#5947] This PR consolidates the public and private dataset listings to use the shared CardItemComponent. - Replace texera-dataset-card-item with the shared read-only texera-card-item in the public hub. - Remove the unused DatasetCardItemComponent and its tests. - Align the hub list/card toggle with the workflow and private dataset listings. - Update the like button styling on the shared card. This keeps dataset cards consistent across all listings and reduces duplicated code. Demo: <img width="1258" height="400" alt="public-datasets" src="https://github.com/user-attachments/assets/49a083f1-d2a4-47f5-99a4-d70f90ccd947" /> ### Any related issues, documentation, discussions? <!-- Please use this section to link other resources if not mentioned already. 1. If this PR fixes an issue, please include `Fixes apache#1234`, `Resolves apache#1234` or `Closes apache#1234`. If it is only related, simply mention the issue number. 2. If there is design documentation, please add the link. 3. If there is a discussion in the mailing list, please add the link. --> Closes apache#5948 ### How was this PR tested? <!-- If tests were added, say they were added here. Or simply mention that if the PR is tested with existing test cases. Make sure to include/update test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> All existing tests passed, and the changes were manually tested. ### Was this PR authored or co-authored using generative AI tooling? <!-- If generative AI tooling has been used in the process of authoring this PR, please include the phrase: 'Generated-by: ' followed by the name of the tool and its version. If no, write 'No'. Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details. --> Generated-by: Claude Opus 4.8
…nent (apache#5566)⚠️ This PR is stacked on apache#5574. Until that lands, the diff below may also include PR 5's QA/ranking task changes depending on which base GitHub is showing. The new code in this PR is the HuggingFaceComponent (task selector + model browser) under frontend/src/app/workspace/component/hugging-face/, plus the formly registration in formly-config.ts and the declaration in app.module.ts. Once PR apache#5574 merges and this PR is retargeted to main, the diff should auto-clean to the PR 6a frontend selector changes only. ### What changes were proposed in this PR? Add `HuggingFaceComponent`, a custom formly field type (`huggingface`) that provides: - A task dropdown listing all supported HuggingFace inference tasks (fetched from the Texera backend's `/huggingface/tasks` endpoint, with a static fallback list) - A paginated model list with client-side search, fetched from the Texera backend's `/huggingface/models` endpoint (which proxies HuggingFace Hub) - Per-task field state preservation — when switching tasks, previously entered values (modelId, promptColumn, etc.) are saved and restored This PR registers the component in `formly-config.ts` and declares it in `AppModule`. The component is not yet wired into the HuggingFace operator's property editor; the `jsonSchemaMapIntercept` mapping that routes the `modelId` field to this component is added in the follow-up property-editor PR (PR 7). ### Any related issues, documentation, discussions? - Tracking issue: apache#5314 - Closes: apache#5314 - Stacked on: PR tracked in issue apache#5292 - Parent issue: apache#5041 ### How was this PR tested? 7 unit tests added in `hugging-face.component.spec.ts` covering: - Static task list is non-empty and contains expected tasks (text-generation, image tasks, audio tasks, QA/ranking tasks) - Task tags are unique - Cache invalidation does not throw Run with `ng test`. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.7 --------- Signed-off-by: Elliot Lin <36275109+ELin2025@users.noreply.github.com> Co-authored-by: Elliot <36275109+Falcons-Royale@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
### What changes were proposed in this PR? Add `HuggingFaceAudioUploadComponent`, a custom formly field type (`huggingface-audio-upload`) that provides: - An audio file picker that uploads to the Texera backend's `/huggingface/upload-audio` endpoint for server-side storage - Authenticated audio preview/playback — fetches server-stored audio via `HttpClient` (which carries the JWT) and creates a blob URL for the `<audio>` element, avoiding 401s on workflow reload - Local audio preview using a browser Object URL while upload is in progress - Stale response guards — discards upload results and preview fetches if the user clears the field while a request is in flight - Concurrent upload protection — disables the file input and Clear button during upload - Storage of the returned server path in the formly form control - A guidance message explaining why audio is uploaded to backend storage rather than embedded in the workflow JSON This PR also registers `HuggingFaceComponent` and `HuggingFaceAudioUploadComponent` in `formly-config.ts` and declares them in `AppModule`. The `jsonSchemaMapIntercept` mapping that routes the `audioInput` field to this component is added in the follow-up property-editor PR (PR 7). ### Any related issues, documentation, discussions? - Tracking issue: apache#5314 - Closes: apache#5314 - Stacked on: apache#5566 - Parent issue: apache#5041 ### How was this PR tested? 38 unit tests in `hugging-face-audio-upload.component.spec.ts` covering: - `ngOnInit` — filename extraction from server paths, data URLs, empty/whitespace values - `previewSrc` — returns data URLs as-is, empty for server paths (loaded async), empty for blank values - `onFileSelected` — successful upload, non-audio rejection, concurrent upload guard, no-file guard, upload-in-progress state, error handling, filename fallback, model updates, Content-Type/URL encoding - `loadServerAudioPreview` — successful blob fetch, fetch failure error message, stale response discard on both success and error, no-fetch for data URLs, URL encoding - `clearAudio` — full state reset, error preservation, model clearing, dirty/touched marking - Stale upload guards — discard success/error when cleared during flight - `ngOnDestroy` — blob URL revocation - `getDisplayName` — forward/backslash paths, trailing separator, flat filename Run with `ng test`. ### Was this PR authored or co-authored using generative AI tooling? Co-authored with Claude Opus 4.6 --------- Co-authored-by: Elliot <36275109+Falcons-Royale@users.noreply.github.com> Co-authored-by: Anish Shivamurthy <anish@uci.edu> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e#6025) ### What changes were proposed in this PR? The usage banner in `bin/local-dev/main.sh` claimed Docker infra is **not** managed by the script and that you must bring it up yourself before `up`. That contradicts the actual code: `cmd_up` calls `infra_up` (which runs `docker compose up -d` against project `texera-local-dev`) and `cmd_down` calls `infra_down` for any docker targets. ``` Before: comment says "infra is NOT managed here -- bring those up yourself" After: comment says "infra IS managed: `up` brings it up, `down` tears it down" ``` Comment-only change; no behavior change. ### Any related issues, documentation, discussions? Related to apache#5960 (introduced `bin/local-dev` and this banner) and apache#6007 (wired `infra_up` into `up`/`auto`, which is what made the banner stale). No dedicated issue — documentation/comment-only fix (issue-first is excepted for docs). The sibling `bin/local-dev/README.md` already documents the correct behavior ("infra in docker, JVM + frontend native"); this just realigns the script's own banner. ### How was this PR tested? No tests — comment-only change. Verified the described behavior by reading `cmd_up` → `infra_up` (`docker compose up -d`, project `texera-local-dev`) and `cmd_down` → `infra_down`, and by running `bin/local-dev.sh up`, which brought up postgres/minio/lakefs/lakekeeper/litellm automatically (14/14 services healthy) with no manual `docker compose` step. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8)
…at (apache#6027) ### What changes were proposed in this PR? Two small, agent/script-friendly additions to `bin/local-dev.sh` — no change to the human TTY experience. **1. `--json` machine-readable status.** `status --json` prints one JSON object on stdout (no colours, no table) and exits `0` iff every service is running, else `1`: ```json {"branch":"...","sha":"...","running":14,"total":14,"services":[ {"service":"texera-web","port":8080,"type":"jvm","pid":25823,"state":"running"}, ...]} ``` `up` and `down` also accept `--json`: human progress is routed to **stderr** (unbuffered) and the final status JSON goes to **stdout** (via a saved fd), so a caller can `up --json >state.json 2>progress.log` and parse stdout directly. **2. Non-TTY build heartbeat.** In non-TTY mode the spinner can't render in place, so a long silent step (`sbt dist`, output redirected to a log) used to print one line then go quiet for 25s+ — indistinguishable from "stuck" to a non-interactive caller. `tui_spinner` now emits `… still running (Ns)` every `TUI_HEARTBEAT_SECS` (default 15), polling at 1s so it still returns within ~1s of the job finishing (no trailing latency). ### Any related issues, documentation, discussions? Closes apache#6026. Usage banner (`--help`) updated to document `--json` on `status`/`up`/`down`. ### How was this PR tested? - `bash bin/local-dev/tests/test_local_dev_sh.sh` → **19 passed, 0 failed** (added 6: JSON shape/consistency, health-based exit code, unknown-flag negative case, `--help` coverage, heartbeat regression guard, up/down `--json` wiring). - `python -m pytest bin/local-dev/tests/` → **38 passed** (no regression). - Dogfooded end to end: `down --json` → stdout pure JSON (`running:0`, 14 `stopped`), exit 0; `up --json` → single-line JSON `running:14/14` on stdout, and stderr showed the new heartbeat across a ~60s build (`… still running (15s/30s/45s/60s)`), 14/14 healthy. - Verified stdout/stderr separation: `up --json >state.json 2>progress.log` yields parseable JSON in `state.json`. ### Was this PR authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.8)
a7bb653 to
f65e408
Compare
|
I did not see testing of this PR, could you please add it? |
|
@carloea2 can you guide me here? I just did some manual testing and database checks here. |
|
Manual checks are useful, but this needs an automated DB-backed regression spec. Take a look at: That spec already uses Please add a regression test under
|
|
Thanks @Mrudhulraj and @carloea2. Let's make sure we have test coverage on the code change. |
What changes were proposed in this PR?
Issue - Duplicate datasets on hub landing page / hub search
Symptom: A user creates a dataset, makes it public, and grants another user explicit access. When the grantee browses the hub, the dataset appears twice in the search results.
Root cause:
DatasetSearchQueryBuilder.constructFromClause produced this SQL:
path:
amber\src\main\scala\org\apache\texera\web\resource\dashboard\DatasetSearchQueryBuilder.scala:72For a dataset that is both public AND explicitly shared with the user, the LEFT JOIN produces one row per matching dataset_user_access row and the OR makes both branches true.
This applies similarly to worflows too.
Fix 1 — DatasetSearchQueryBuilder.constructFromClause
Move the UID filter from the WHERE clause into the JOIN's ON clause so each dataset produces at most one joined row, and force the JOIN to FALSE when uid == null so the SELECT still references a valid
table.
Why
ANDFALSEforuid == null?The
SELECTreferencesDATASET_USER_ACCESS.PRIVILEGE. Withoutdataset_user_accessin theFROM, DB throws missingFROM-clause entry for table "dataset_user_access".
ANDFALSEkeeps the table in the FROM while making the JOIN yield NULL access columns — which is the correct semantic for "no explicit grant".Behavior matrix:
Any related issues, documentation, discussions?
Refs #5957
How was this PR tested?
Tested manually with database checks and UI workflow testing.
Was this PR authored or co-authored using generative AI tooling?
No AI tools were used in the process.